The Lancet Digital Health
○ Elsevier BV
All preprints, ranked by how well they match The Lancet Digital Health's content profile, based on 25 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Datta, A.; Rozwadowski, P.; Broadhurst, P.; Evison, M.; Sharman, A.; Bramley, R.
Show abstract
ObjectivesTo assess whether an artificial intelligence (AI) chest radiograph (CXR) tool could enhance lung cancer detection on primary care-referred CXRs in the UK, and to estimate the magnitude of any improvement. MethodsFrom [~]280,000 primary care-referred CXRs, we identified 1,600 linked to a lung cancer diagnosis (ICD-10 C34) within six months. Missed lung cancers were defined by review of the CXR report and comparison of diagnostic CT and positron emission tomography (PET) imaging with the index CXR by three specialist radiology clinicians. CXRs with a retrospectively visible but initially missed cancer were re-analysed using a commercially available AI tool. The primary outcome was the enhanced detection rate (EDR), defined as the proportion of confirmed cancers missed on CXR but correctly identified by AI. ResultsOf 1,600 CXRs, 105 (6.6%) contained a retrospectively visible cancer that had been missed at first report. AI flagged abnormalities in 72/105 (69%) and delineated the primary tumour in 38/105 (36%). This equated to an absolute EDR of 2.4% and a relative EDR of 2.9%. Missed lesions were concentrated in central and upper zones, whereas AI detections were more frequent in peripheral locations. ConclusionsAI identified over one-third of retrospectively visible lung cancers that were missed at initial CXR reporting. Implementation as decision support could provide a modest but potentially meaningful increase in lung cancer detection in primary care. Advances in knowledgeIn this large real-world UK study, AI modestly improved lung cancer detection on CXR, with complementary detection patterns to human readers, but performance remained limited in anatomically complex regions.
Doroodgar Jorshery, S.; Chandra, J.; Walia, A.; Sturniolo, A.; Corey, K.; Zekavat, S. M.; Zinzuwadia, A.; Patel, K.; Short, S.; Mega, J.; Plowman, S.; Pagidipati, N. J.; Sullivan, S.; Mahaffey, K.; Shah, S. H.; Hernandez, A. F.; Christiani, D.; Aerts, H.; Weiss, J.; Lu, M. T.; Raghu, V.
Show abstract
BackgroundThis study assessed whether deep learning applied to routine outpatient chest X-rays (CXRs) can identify individuals at high risk for incident chronic obstructive pulmonary disease (COPD). MethodsUsing cancer screening trial data, we previously developed a convolutional neural network (CXR-Lung-Risk) to predict lung-related mortality from a CXR image. In this study, we externally validated CXR-Lung-Risk to predict incident COPD from routine CXRs. We identified outpatients without lung cancer, COPD, or emphysema who had a CXR taken from 2013-2014 at a Mass General Brigham site in Boston, Massachusetts. The primary outcome was 6-year incident COPD. Discrimination was assessed using AUC compared to the TargetCOPD clinical risk score. All analyses were stratified by smoking status. A secondary analysis was conducted in the Project Baseline Health Study (PBHS) to test associations between CXR-Lung-Risk with pulmonary function and protein abundance. FindingsThe primary analysis consisted of 12,550 ever-smokers (mean age 62{middle dot}4{+/-}6{middle dot}8 years, 48.9% male, 12.4% rate of 6-year COPD) and 15,298 never-smokers (mean age 63{middle dot}0{+/-}8{middle dot}1 years, 42.8% male, 3.8% rate of 6-year COPD). CXR-Lung-Risk had additive predictive value beyond the TargetCOPD score for 6-year incident COPD in both ever-smokers (CXR-Lung-Risk + TargetCOPD AUC: 0{middle dot}73 [95% CI: 0{middle dot}72-0{middle dot}74] vs. TargetCOPD alone AUC: 0{middle dot}66 [0{middle dot}65-0{middle dot}68], p<0{middle dot}01) and never-smokers (CXR-Lung-Risk + TargetCOPD AUC: 0{middle dot}70 [0{middle dot}67-0{middle dot}72] vs. TargetCOPD AUC: 0{middle dot}60 [0{middle dot}57-0{middle dot}62], p<0{middle dot}01). In secondary analyses of 2,097 individuals in the PBHS, CXR-Lung-Risk was associated with worse pulmonary function and with abundance of SCGB3A2 (secretoglobin family 3A member 2) and LYZ (lysozyme), proteins involved in pulmonary physiology. InterpretationIn external validation, a deep learning model applied to a routine CXR image identified individuals at high risk for incident COPD, beyond known risk factors. FundingThe Project Baseline Health Study and this analysis were funded by Verily Life Sciences, San Francisco, California. ClinicalTrials.gov IdentifierNCT03154346
Aghlmandi, S.; Shafiezadeh, S.; Huber, C.; Godet, P.; Bucher, H. C.; Bielicki, J. A.
Show abstract
ObjectivesTo evaluate whether machine learning (ML) applied to comprehensive claims data without diagnostic codes can distinguish a high proportion of antibiotic treatment episodes as urinary tract infection (UTI) or non-UTI cases. Such approaches may be valuable for antimicrobial stewardship when diagnosis-linked datasets are unavailable. MethodsOutpatient antibiotic prescription claims from three major Swiss insurers (2017-2020; [~]40% of the Swiss population) were analyzed. Based on clinical input, specific constellations of claims codes (e.g. positive urine culture plus typical antibiotic) were a priori assigned as indicating UTI episodes, providing the reference classification. Predictors included sex, age group, comorbidity, and diagnostic tests ordered during the episode. Four ML classifiers were tested; performance and interpretability were evaluated, with XGBoost prioritized. ResultsAfter cleaning and balancing, 38,982 records (19,491 UTI; 19,491 non-UTI) were included. XGBoost achieved an AUC of 0.94, accuracy of 87.6%, sensitivity of 79.2%, and specificity of 96.1%. Misclassification was asymmetric: 11% of non-UTI cases were labeled UTI, while 2% of UTI cases were misclassified as non-UTI. Diagnostics ordered were the strongest predictors, followed by female sex and older age. ConclusionsEven in the absence of diagnosis codes, ML applied to claims data can reliably identify UTI-related prescriptions. This supports the feasibility of claims-based surveillance tools for stewardship, while in parallel highlighting the need for scalable, low-burden approaches to improve direct diagnostic coding in routine data.
Rabby, A. S. A.; Chaudhary, M. F. A.; Saha, P.; Sthanam, V.; Nakhmani, A.; Zhang, C.; Barr, R. G.; Bon, J.; Cooper, C. B.; Curtis, J. L.; Hoffman, E. A.; Paine, R.; Puliyakote, A. K.; Schroeder, J. D.; Sieren, J. C.; Smith, B.; Woodruff, P. G.; Reinhardt, J. M.; Bhatt, S. P.; Bodduluri, S.
Show abstract
BackgroundApproximately 70% of adults with chronic obstructive pulmonary disease (COPD) remain undiagnosed. Opportunistic screening using chest computed tomography (CT) scans, commonly acquired in clinical practice, may be used to improve COPD detection through simple, clinically applicable deep-learning models. We developed a lightweight, convolutional neural network (COPDxNet) that utilizes minimally processed chest CT scans to detect COPD. MethodsWe analyzed 13,043 inspiratory chest CT scans from the COPDGene participants, (9,675 standard-dose and 3,368 low-dose scans), which we randomly split into training (70%) and test (30%) sets at the participant level to no individual contributed to both sets. COPD was defined by postbronchodilator FEV /FVC < 0.70. We constructed a simple, four-block convolutional model that was trained on pooled data and validated on the held-out standard- and low-dose test sets. External validation was performed using standard-dose CT scans from 2,890 SPIROMICS participants and low-dose CT scans from 7,893 participants in the National Lung Screening Trial (NLST). We evaluated performance using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, Brier scores, and calibration curves. FindingsOn COPDGene standard-dose CT scans, COPDxNet achieved an AUC of 0.92 (95% CI: 0.91 to 0.93), sensitivity of 80.2%, and specificity of 89.4%. On low-dose scans, AUC was 0.88 (95% CI: 0.86 to 0.90). When the COPDxNet model was applied to external validation datasets, it showed an AUC of 0.92 (95% CI: 0.91 to 0.93) in SPIROMICS and 0.82 (95% CI: 0.81 to 0.83) on NLST. The model was well-calibrated, with Brier scores of 0.11 for standard- dose and 0.13 for low-dose CT scans in COPDGene, 0.12 in SPIROMICS, and 0.17 in NLST. InterpretationCOPDxNet demonstrates high discriminative accuracy and generalizability for detecting COPD on standard- and low-dose chest CT scans, supporting its potential for clinical and screening applications across diverse populations.
Petersen, L. A.; Beck, M. S.; Xu, J. J.; Andersen, M. B.; Bruun, F. J.
Show abstract
AimThe aim of this study was to test whether open-source Large Language Models (LLMs) can match the diagnostic accuracy of proprietary models in annotating Danish trauma radiology reports across three clinical findings. Materials and MethodsThis retrospective study included 2,939 radiology reports of trauma radiographs collected from three Danish emergency departments. The data were split, with 600 cases for prompt engineering and 2,339 for model evaluation. Eight LLMs, GPT-4o and GPT-4o-mini (OpenAI), and six Llama3 variants (Meta) were prompted to annotate the reports for fractures, effusions, and luxations. The reference standard was human annotations. The diagnostic performance was assessed using accuracy, sensitivity, specificity, PPV, and NPV with 95% confidence intervals. ResultsPrompt engineering improved the Match-score for Llama3-8b from 77.8% (95% CI: 74.4% - 81.1%) to 94.3% (95% CI: 92.5% - 96.2%). GPT-4o achieved the highest overall diagnostic accuracy at 97.9% (95% CI: 97.3% - 98.5%), followed by Llama3.1-405b (97.1% (95% CI: 96.4% - 97.8%)), GPT-4o-mini (96.9% (95% CI: 96.2% - 97.6%)), Llama3-8b (96.9% (95% CI: 95.9% - 97.3%)), and Llama3.1-70b (96.0% (95% CI: 95.2% - 96.8%)). Across the three specific findings, all models performed best for fractures, whereas effusion and luxation were more prone to errors. Of the error types, Semantic Confusion was the most frequent, with 53.2% to 59.4% of misclassifications. ConclusionSmall, open-source LLMs can accurately annotate Danish trauma radiology reports when supported by effective prompt engineering, achieving accuracy levels that rival proprietary competitors. They offer a viable, privacy-conscious alternative for clinical use, even in a low-resource language setting.
Morgan, A.; Contreras, E.; Yasuda, M.; Dutta, S.; Hamel, D. J.; Shankar, T.; Balallo, D.; Riedel, S.; Kirby, J. E.; Kanki, P. J.; Arnaout, R.
Show abstract
BackgroundRegulatory approval of new over-the-counter tests for infectious agents such as SARS-CoV-2 has historically required that clinical trials include diverse groups of specific patient populations, making the approval process slow and expensive. Showing that populations do not differ in their viral loads--the key factor determining test performance--could expedite the evaluation of new tests. Methods46,726 RT-qPCR-positive SARS-CoV-2 viral loads were annotated with patient demographics and health status. Real-world performance of two commercially available antigen tests was evaluated over a wide range of viral loads. An open-access web portal was created allowing comparisons of viral-load distributions across patient groups and application of antigen-test performance characteristics to patient distributions to predict antigen-test performance on these groups. FindingsIn several cases distributions were surprisingly similar where a difference was expected (e.g. smokers vs. non-smokers); in other cases there was a difference that was the opposite direction from expectations (e.g. higher in patients who identified as White vs. Black). Sensitivity and specificity of antigen tests for detecting contagiousness were similar across most groups. The portal is at https://arnaoutlab.org/coviral/. ConclusionsIn silico analyses of large-scale, real-world clinical data repositories can serve as a timely evidence-based proxy for dedicated trials of antigen tests for specific populations. Free availability of richly annotated data facilitates large-scale hypothesis generation and testing. FundingFunded by the Reagan-Udall Foundation for the FDA (RA and JEK) and via a Novel Therapeutics Delivery Grant from the Massachusetts Life Sciences Center (JEK).
Fisher, G. R.
Show abstract
In previous work, we achieved state-of-the-art performance on ChestX-ray14 (ROC-AUC 0.940, F1 0.821) using pretraining diversity and clinical metric optimization. Applying the same methodology to CheXpert, we received similar results when using NLP valuation and test data--but when evaluated against expert radiologist labels, performance was only 0.75-0.87 ROC-AUC. The models had learned to match the automated NLP labeling system, not to diagnose disease. This paper documents our investigation into this failure and our suggested resolution. We identify the NLP-to-Expert generalization gap: a systematic divergence between models optimized on labels extracted from radiology reports and their agreement with board-certified radiologists. More surprisingly, we discovered that directly optimizing for small expert-labeled validation sets can be counterproductive-- models with lower validation scores often generalized better to held-out expert test data. Four findings emerged: First, expert-labeled images for at least the validation and testing datasets, even if not for training, were vital in revealing the gap between NLP agreement and diagnostic accuracy. Without them, our models appeared excellent while failing to generalize to clinical judgment. Second, less training is better. Short training (1-5 epochs) outperformed extended training (60+ epochs) because longer training doesnt improve the model--it memorizes the labelers mistakes. Third, ImageNet features are sufficient. Freezing the pretrained backbone and training only the classifier achieved 0.891 ROC-AUC--matching models with full fine-tuning. The rapid convergence we observed wasnt the model learning chest X-ray features; it was the classifier calibrating to already-sufficient visual representations. Fourth, regularization beats optimization. Label smoothing and frozen backbones--methods that prevent overfitting--outperformed direct metric optimization on small validation sets. The 200 expert-labeled validation images in CheXpert are too few to optimize directly; they are better used as a compass than a target. With these insights, we improved from 0.823 to 0.917 ROC-AUC, exceeding Stanfords official baseline (0.907).
Sharma, N.; Ng, A. Y.; James, J. J.; Khara, G.; Ambrozay, E.; Austin, C. C.; Forrai, G.; Glocker, B.; Heindl, A.; Karpati, E.; Rijken, T. M.; Venkataraman, V.; Yearsley, J. E.; Kecskemethy, P. D.
Show abstract
ImportanceScreening mammography with two human readers increases cancer detection and lowers recall rates, but workforce shortages make double reading unsustainable in many countries. Artificial intelligence (AI) as an independent reader in double reading may support screening performance while improving cost-effectiveness. The clinical validation of AI requires large-scale, multi-vendor studies on unenriched cohorts. ObjectiveTo evaluate the performance of the Mia(R) AI system on data that the AI system would process in real-world deployments. DesignA retrospective study simulating the impact of AI on an unenriched screening sample. SettingSeven European breast screening sites representing four centers: three from the UK and one in Hungary (HU), between 2009 and 2019. ParticipantsThe sample included 275,900 cases (177,882 participants) from seven screening sites, involving two countries and four hardware vendors from 2009 to 2019. InterventionSimulation of double reading using AI as an independent reader in breast cancer screening on historical data. Main Outcomes and MeasuresPerformance was determined for standalone AI compared to the historical single reader and for simulated double reading with AI compared to historical double reading, assessing non-inferiority and superiority on relevant screening metrics using a non-inferiority margin of 10% relative difference and a one-sided alpha of 2.5% for both tests. ResultsStandalone AI detected 29.8% of missed interval cancers. When compared with historical double reading, double reading with AI showed non-inferiority for sensitivity and superiority for recall rate, specificity and positive predictive value. AI as an independent reader reduced the workload for the second human reader but increased the arbitration rate from 3.3% to 12.3%. Applying the AI system could have reduced the human reading time required by up to 44.8% and reduced the recall rate by a relative 7.7% (from 5.2% to 4.8%). Conclusions and RelevanceUsing the AI system as an independent reader maintains or improves the double reading standard of care, while substantially reducing the workload. Thus, it has the potential to provide operational and economic benefits. Trial RegistrationRegistered on ISRCTN, study ID: ISRCTN18056078
Marfil, M.; Bonmati, L. M.
Show abstract
BackgroundMost mammographic AI stops at a binary cancer decision and ignores the histopathologic information that guides therapy. PurposeTo build an end-to-end virtual-biopsy pipeline that simultaneously segments tumours and predicts pT, pN, cM and molecular subtype across scanners and countries. Materials and MethodsA residual U-Net was trained on 8 469 public mammograms and externally tested on 200 cases from six ChAImeleon sites. From each mask we derived 1 559 radiomics, 5 120 RadImageNet embeddings and 12 clinical covariates; a 5 x 3 nested cross-validation tuned an RBF-SVM/XGBoost/MLP ensemble with a random-forest meta-learner. ResultsExternal Dice was 0.458; adding benign masks cut false-positive pixels by 19% but left Dice unchanged (0.455).
Plumb, A. A.; Obaro, A. E.; Bassett, P.; Baldwin-Cleland, R.; Halligan, S.; Burling, D.
Show abstract
BackgroundColorectal cancer (CRC) is a common, important healthcare priority and improving patient outcome relies on early diagnosis. Colonoscopy and computed tomographic colonography (CTC) are commonly-used diagnostic tests. Although colonoscopists are highly regulated and must be accredited, no analogous process exists for CTC. There are currently no universally accepted radiologist performance indicators for CTC, and lack of regulatory oversight may lead to variability in quality and lower neoplasia detection rates. This study aims to determine whether a structured educational training and feedback programme can improve radiologist interpretation accuracy. MethodsNHS England CTC reporting radiologists will be cluster randomised to either an intervention (one-day individualised training and assessment with feedback) or control (assessment with no training or feedback) arm. Each cluster represents radiologists reporting CTC in a single NHS site. Both the intervention and control arm will undertake four CTC assessments at baseline, 1-month (after training; intervention arm or enrolment; control arm), 6- and 12 months to assess their detection of colorectal cancer (CRC) and 6mm+ polyps. The primary outcome will be difference in sensitivity at the 1-month test between arms. Secondary outcomes will include sensitivity at 6 and 12 months and radiologist characteristics associated with improved performance. Multilevel logistic regression will be used to analyse per-polyp and per-case sensitivity. Local ethical and Health Research Authority approval have been obtained. DiscussionLack of infrastructure to ensure that CTC radiologists can report adequately and lack of consensus regarding appropriate quality metrics may lead to variability in performance. Our provision of a structured education programme with feedback will evaluate the impact of individualised training and identify the factors related to improved radiologist performance in CTC reporting. An improvement in performance could lead to increased neoplasia detection and better patient outcome. RegistrationClinical Trials (ClinicalTrials.gov Identifier: NCT02892721); available from: https://clinicaltrials.gov/ct2/show/NCT02892721. NIHR Clinical Research Network (CPMS ID 32293).
Farquhar, H.
Show abstract
BackgroundFoundation models have emerged as a promising paradigm for medical imaging AI [7], with claims of improved generalization and reduced bias. However, their robustness to technical acquisition parameters remains unexplored. We evaluated whether foundation models exhibit greater robustness to chest radiograph view type (anteroposterior [AP] versus posteroanterior [PA]) compared to traditional convolutional neural networks. MethodsWe compared four model architectures on the RSNA Pneumonia Detection Challenge dataset (n=26,684 images) and externally validated on the NIH ChestX-ray14 dataset (n=112,120 images): DenseNet-121 (supervised CNN), BiomedCLIP (vision-language model trained on 15 million biomedical image-text pairs), RAD-DINO (self-supervised model trained on 5+ million radiographs), and CheXzero (vision-language model trained on MIMIC-CXR chest radiographs). Primary outcome was the sensitivity gap between AP and PA views, with bootstrap confidence intervals and permutation testing. ResultsOn RSNA, CheXzero showed the smallest gap (14.3%, 95% CI: 11.2-17.5%), followed by RAD-DINO (25.2%, 22.6-27.9%), DenseNet-121 (35.7%, 32.9-38.7%), and BiomedCLIP (36.1%, 33.5-39.0%). However, on external validation (NIH), model rankings reversed completely: RAD-DINO demonstrated the smallest gap (22.3%, 95% CI: 21.0-23.6%), while CheXzeros gap increased dramatically to 48.9% (95% CI: 47.7-50.1%). Domain-specific training provided robustness within the training domain but failed to generalize. On PA view pneumonia cases in NIH, 31% were missed by all four models, representing a systematic blind spot. View type explained 61-100% of performance variance across models on both datasets, compared to 0-38% for age and less than 4% for sex. ConclusionsFoundation models do not eliminate technical acquisition parameter biases in chest X-ray AI. While domain-specific training (CheXzero) provided superior robustness on internal validation, this advantage collapsed on external data. Self-supervised learning (RAD-DINO) demonstrated the most generalizable robustness, with consistent view type gap stability across datasets with different labeling schemes (25.2% [->] 22.3%, despite substantial AUC differences). These findings challenge assumptions about foundation model generalization and highlight the need for acquisition parameter auditing in AI regulatory frameworks and multi-site external validation for robustness claims.
Quill, S.; Hingorani, A. D.; Chaturvedi, N.; Schmidt, A. F.
Show abstract
BackgroundPopulation cancer screening detects the presence of early-stage disease rather than assessing future disease risk. We evaluated whether widely implemented cardiovascular disease (CVD) risk models can predict 10-year cancer risk and compared them with a less widely used cancer risk model (QCancer). MethodsWe evaluated four CVD prediction models: QRISK3, the Pooled Cohort Equations (PCE), SCORE2 and SCORE2-OP. All models were recalibrated using 20% of the UK Biobank (UKB) cohort and tested in the remainder, as well as in the Clinical Practice Research Datalink (CPRD). We gauged model performance using c-statistics for discrimination and evaluated the fidelity of calibration. We also identified the most influential risk factors in the QRISK3 model. FindingsIn the UKB test set, the c-statistics for incident CVD ranged from 0{middle dot}71 to 0{middle dot}74 (11,022 events). All CVD models achieved a c-statistic of 0{middle dot}63 for any cancer (23,010 events) and showed CVD-equivalent discrimination for gastro-oesophageal, liver and biliary tree, laryngeal, renal tract, and lung cancers (c-statistic range: 0{middle dot}70;0{middle dot}81). Overall, the discrimination of the CVD models was comparable that of the QCancer models (median difference in c-statistic: -0{middle dot}01 (95%CI -0{middle dot}03;0{middle dot}00). The recalibrated CVD models showed near-perfect calibration (median intercept 0{middle dot}01, Q1;Q3 -0{middle dot}05;0{middle dot}03 and slope 1{middle dot}00, Q1;Q3 0{middle dot}93;1{middle dot}15). Performance in CPRD (393,658 cancer events) was similar: the median c-statistic, calibration intercept, and slope were 0{middle dot}01 (95%CI 0{middle dot}00;0{middle dot}02), 0{middle dot}05 (95%CI 0{middle dot}02;0{middle dot}17), and 0{middle dot}04 (95%CI 0{middle dot}01;0{middle dot}15) higher, respectively, in CPRD than in UKB. After age, smoking status and systolic blood pressure were the most influential predictors of cancer risk. InterpretationWidely implemented CVD prediction models perform similarly to the QCancer models in the prediction of incident cancers. They may be used to inform cancer prevention and guide risk-stratified monitoring. The recalibrated models are available through an API. FundingHealth Data Research UK, British Heart Foundation and UK Research and Innovation.
Farquhar, H. L.
Show abstract
Artificial intelligence systems for chest radiograph interpretation are increasingly deployed in clinical practice, yet current fairness frameworks emphasize demographic subgroup analysis while the relative contribution of technical acquisition parameters to performance disparities remains poorly characterized. We conducted a multi-dataset validation study analyzing 138,804 chest radiographs from the RSNA Pneumonia Detection Challenge (n=26,684; 22.5% pneumonia prevalence) and NIH ChestX-ray14 (n=112,120; 1.3% prevalence) using five pre-trained DenseNet-121 models. We calculated sensitivity, specificity, and area under the receiver operating characteristic curve stratified by view type (anteroposterior versus posteroanterior), age group, and sex, with performance disparity analysis quantifying each factors contribution to performance variation. View type dominated total observed performance range in both datasets: 87% in RSNA and 69% in NIH. All five models demonstrated systematic posteroanterior view underdiagnosis with miss rates of 30-78%. The odds ratio for missed diagnosis on posteroanterior versus anteroposterior views was 6.69 (95% CI: 5.79-7.72) in RSNA and 13.02 (95% CI: 11.62-14.59) in NIH. Analysis of 131,361 disease-free images demonstrated that view-type effects persist strongly even without disease (Cohens d = 1.19-1.33), providing compelling evidence against the hypothesis that observed disparities reflect disease severity confounding rather than learned image characteristics. Age explained 5-30% of the total observed performance range depending on dataset, while sex consistently explained less than 2%. Technical acquisition parameters, specifically radiograph view type, dominate performance disparities in chest X-ray AI substantially exceeding demographic factor contributions. These findings have immediate implications for regulatory frameworks: future FDA and EU AI Act guidance should explicitly mandate acquisition parameter auditing alongside demographic subgroup analysis. Author SummaryArtificial intelligence systems that interpret chest X-rays are being used in hospitals worldwide. There has been important work examining whether these systems perform fairly across different patient groups--for example, whether they work equally well for men and women, or for patients of different ages and races. We asked a different question: does the way the X-ray was taken affect how well AI systems perform? We found that the technical method used to acquire the image--specifically, whether the X-ray beam was directed from back to front (posteroanterior view, typical in outpatient settings) or front to back (anteroposterior view, typical in emergency and inpatient settings)--explained 69-87% of the variation in AI performance. In contrast, age explained only 5-30% and sex less than 2%. Most concerning, AI systems missed 30-78% of pneumonia cases in standing patients across all five systems we tested. This matters because current regulations focus on checking AI performance across demographic groups but do not require checking performance across technical acquisition parameters. Our findings suggest regulators and hospitals should audit how AI systems perform on different types of X-ray images, not just different types of patients.
Chaudhary, M. F. A.; Awan, H. A.; Gerard, S. E.; Bodduluri, S.; Comellas, A. P.; Barjaktarevic, I. Z.; Barr, R. G.; Cooper, C. B.; Galban, C. J.; Han, M. K.; Curtis, J. L.; Hansel, N. N.; Krishnan, J. A.; Menchaca, M. G.; Martinez, F. J.; Ohar, J.; Buonfiglio, L. G. V.; Paine, R.; Bhatt, S. P.; Hoffman, E. A.; Reinhardt, J. M.
Show abstract
RationaleQuantifying functional small airways disease (fSAD) requires additional expiratory computed tomography (CT) scan, limiting clinical applicability. Artificial intelligence (AI) could enable fSAD quantification from chest CT scan at total lung capacity (TLC) alone (fSADTLC). ObjectivesTo evaluate an AI model for estimating fSADTLC and study its clinical associations in chronic obstructive pulmonary disease (COPD). MethodsWe analyzed 2513 participants from the SubPopulations and InteRmediate Outcome Measures in COPD Study (SPIROMICS). Using a subset (n = 1055), we developed a generative model to produce virtual expiratory CTs for estimating fSADTLC in the remaining 1458 SPIROMICS participants. We compared fSADTLC with dual volume, parametric response mapping fSADPRM. We investigated univariate and multivariable associations of fSADTLC with FEV1, FEV1/FVC, six-minute walk distance (6MWD), St. Georges Respiratory Questionnaire (SGRQ), and FEV1 decline. The results were validated in a subset (n = 458) from COPDGene study. Multivariable models were adjusted for age, race, sex, BMI, baseline FEV1, smoking pack years, smoking status, and percent emphysema. Measurements and Main ResultsInspiratory fSADTLC was highly correlated with fSADPRM in SPIROMICS (Pearsons R = 0.895) and COPDGene (R = 0.897) cohorts. In SPIROMICS, fSADTLC was associated with FEV1 (L) (adj.{beta} = -0.034, P < 0.001), FEV1/FVC (adj.{beta} = -0.008, P < 0.001), SGRQ (adj.{beta} = 0.243, P < 0.001), and FEV1 decline (mL / year) (adj.{beta} = -1.156, P < 0.001). fSADTLC was also associated with FEV1 (L) (adj.{beta} = -0.032, P < 0.001), FEV1/FVC (adj.{beta} = -0.007, P < 0.001), SGRQ (adj.{beta} = 0.190, P = 0.02), and FEV1 decline (mL / year) (adj.{beta} = - 0.866, P = 0.001) in COPDGene. We found fSADTLC to be more repeatable than fSADPRM with intraclass correlation of 0.99 (95% CI: 0.98, 0.99) vs. 0.83 (95% CI: 0.76, 0.88). ConclusionsInspiratory fSADTLC captures small airways disease as reliably as fSADPRM and is associated with FEV1 decline. Funding SourceThis work was supported by NHLBI grants R01 HL142625, U01 HL089897 and U01 HL089856, by NIH contract 75N92023D00011, and by a grant from The Roy J. Carver Charitable Trust (19-5154). The COPDGene study (NCT00608764) has also been supported by the COPD Foundation through contributions made to an Industry Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion.
Wiering, B.; Mounce, L. T.; Price, S. J.; Shotter, D.; Valderas, J. M.; Merriel, S. W.; Moore, S.; Farmer, L.; Von Wagner, C.; Payne, R. A.; Renzi, C.; Lyratzopoulos, G.; Hamilton, W.; Abel, G. A.
Show abstract
BackgroundExpediting cancer diagnosis is a priority in many countries. The rising prevalence of chronic conditions may complicate the cancer diagnostic process. We investigated whether patients with pre-existing morbidity were more likely to experience disadvantage in cancer diagnostic outcomes and processes. MethodsWe used linked primary, secondary care, and cancer registration data for patients aged 40+ years diagnosed with incident cancer in England during 2012-2018. The Cambridge Multimorbidity Score quantified morbidity burden. Logistic regressions investigated whether morbidity burden was associated with stage at diagnosis, 30-day all-cause mortality, emergency presentation- or urgent suspected cancer referral route to diagnosis. Results288,297 patients were included. Decreasing morbidity burden was associated with an increased likelihood of advanced-stage diagnosis (e.g. high burden vs. no burden aOR: 0.72, 95% CI: 0.7-0.75, p<0.0001). There were u-shaped relationships between morbidity burden, emergency diagnoses and 30-day mortality, with those with high or no morbidity burden most likely to be diagnosed as an emergency and to die within 30 days after diagnosis. Diagnoses via urgent suspected cancer referrals decreased with increasing morbidity burden. Associations varied across cancer sites, though higher morbidity burden was not associated with advanced stage for any cancer. ConclusionContrary to expectations, not having pre-existing morbidities was associated with an increased risk of advanced-stage diagnosis and emergency presentations. This may reflect heightened surveillance of patients with morbidity being protective against later advanced-stage cancer diagnoses. These findings highlight the need for robust cancer surveillance processes and good comprehensive care that considers cancer alongside wider aspects of health.
Cameron, A. C.; Arnold, M.; Katsas, G.; Yang, J.; Quinn, T.; Abdul-Rahim, A. H.; Campbell, R.; Docherty, K.; De Marchis, G. M.; Arnold, M.; Kahles, T.; Nedeltchev, K.; Cereda, C.; Kaegi, G.; Bustamante, A.; Montaner, J.; Ntaios, G.; Foerch, C.; Spanaus, K.; Von Eckardstein, A.; Dawson, J.; Katan, M.
Show abstract
BackgroundProlonged cardiac monitoring (PCM) increases atrial fibrillation detection after stroke (AFDAS) but access is limited. We aimed to assess the utility of midregional pro-atrial natriuretic peptide (MR-proANP) and N-terminal pro-B-type natriuretic peptide (NT-proBNP) to identify people who are unlikely to have AFDAS and improve healthcare resource allocation for PCM.. MethodsWe analysed people from the BIOSIGNAL (Biomarker Signature of Stroke Aetiology) study with ischaemic stroke, no known AF and [≥]3 days cardiac monitoring. External validation was in the PRECISE (Preventing Recurrent Cardioembolic Stroke: Right Approach, Right Patient) study of 28-days cardiac monitoring after stroke. The main outcome is no AFDAS. We assessed the discriminatory value of MR-proANP and NT-proBNP combined with clinical variables to identify people with no AFDAS. We determined the net reduction in people who would undergo PCM using the models with 15% AFDAS threshold probability. ResultsWe included 621 people from BIOSIGNAL. The clinical model included age, National Institutes of Health Stroke Scale score, lipid-lowering therapy, creatinine and smoking status. The AUROC was 0.68 (95%CI 0.62-0.74) with clinical variables, which improved with log10MR-proANP (0.72,0.66-0.78;p=0.001) or log10NT-proBNP (0.71,0.65-0.77;p=0.009). Performance was similar for log10MR-proANP versus log10NT-proBNP (p=0.28). In 239 people from PRECISE, the AUROC for clinical variables was 0.68 (0.59-0.76), which improved with log10NT-proBNP (0.73,0.65-0.82;p<0.001) or log10MR-proANP (0.79,0.72-0.86;p<0.001). Performance was better with log10MR-proANP versus log10NT-proBNP (p=0.03). The models could reduce the number who would undergo PCM by 30% (clinical+log10MR-proANP), 27% (clinical+log10NT-proBNP) or 20% (clinical). ConclusionsMR-proANP and NT-proBNP help classify people who are unlikely to have AFDAS and could reduce the number who need PCM by 30%.
Barrot, J.; Cayla, J. A.; Mata-Cases, M.; Real, J.; Vlacho, B.; Franch-Nadal, J.; Mauricio, D.; COVID-19 Working Group in Primary Health Care,
Show abstract
Abstract textO_ST_ABSObjectiveC_ST_ABSThis study aimed to identify prognostic factors associated with poor outcomes of COVID-19 at diagnosis in Primary Health Care (PHC). MethodsWe conducted a retrospective, longitudinal study using the SIDIAP database, part of the PHC Information System of Catalonia. The analysis included COVID-19 cases diagnosed in patients aged 18 and older from March 2020 to September 2022. Follow-up was conducted for 90 days post-diagnosis or until death. Various machine learning models of differing complexities were used to predict short-term events, including mortality and hospital complications. Each model was tailored to maximize the predictive accuracy for poor outcomes, exploring algorithms such as Generalized Linear Models, flexible GLMs with Lasso, Gradient Boosting Models, and Support Vector Machines, with the model demonstrating the highest Area Under the Curve (AUC) selected for optimal performance. ResultsA total of 2,162,187 COVID-19 cases were identified across five epidemic waves. Key predictors of short-term complications included age and the epidemic wave. Additional significant factors encompassed social deprivation (MEDEA), blood pressure, cardiovascular history, chronic obstructive pulmonary disease (COPD), obesity, and diabetes mellitus. The models exhibited high performance, with AUC values ranging from 0.73 to 0.95. A web application was developed to estimate the risk of adverse outcomes based on individual patient profiles (https://dapcat.shinyapps.io/CovidScore). ConclusionsIn addition to age and epidemic wave, predictors such as social deprivation, diabetes mellitus, obesity, COPD, cardiovascular disease, high blood pressure, and dyslipidemia significantly indicate poor prognosis in COVID-19 patients diagnosed in PHC, and the developed application facilitates risk quantification for individual patients.
Mitsuyama, Y.; Walston, S. L.; Takita, H.; Saito, K.; Ueda, D.
Show abstract
Purpose: To evaluate whether chest radiograph-derived age acceleration is associated with incident lung cancer and whether it improves discrimination beyond established lung cancer risk factors. Materials and Methods: This retrospective analysis used prospectively collected data from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. Baseline digitized chest radiographs from the initial screening year were analyzed using a previously validated deep learning model that estimates chest radiograph-derived age (Xp-age). Age acceleration (AgeAccel) was defined as the residual of Xp-age after calibration to chronological age using a regression model from the development dataset. A 1-year landmark design excluded participants diagnosed with lung cancer or censored within 1 year of baseline. Associations with incident lung cancer were assessed using multivariable Cox proportional-hazards models adjusted for prespecified demographic and clinical predictors, including smoking variables used in the PLCOm2012 risk prediction model. Discrimination was evaluated using the concordance index and 6-year time-dependent area under the receiver-operating-characteristic curve. Results: The analytic cohort included 23,213 participants (mean age, 62.5 years); 790 developed incident lung cancer after the landmark (mean follow-up, 16.7 years). Higher AgeAccel was associated with increased lung cancer incidence (hazard ratio, 1.10 per 1-SD increase; 95% confidence interval: 1.03- 1.17); however, addition of AgeAccel to an established risk factor model resulted in minimal change in discrimination (C-index, 0.840 vs. 0.839; time-dependent AUC at 6 years, 0.852 vs. 0.852). Attribution maps emphasized the aortic arch/mediastinal region with similar spatial patterns across smoking and lung cancer strata. Conclusion: Chest radiograph-derived age acceleration was independently associated with future lung cancer incidence.
Budi Susilo, Y. K.; Abdul Rahman, S.; Amgain, K.; Yuliana, D.
Show abstract
Artificial intelligence (AI) is transforming precision medicine, particularly in cardiovascular disease prevention and management. This bibliometric analysis examines the research land-scape from 2020 to 2024, focusing on AIs role in improving diagnostics, personalizing treatment, and advancing predictive healthcare. Using the PRISMA framework, VOSviewer, Harzings Publish or Perish, and Excel, 137 articles from Scopus were systematically analyzed. The study reveals a significant surge in research activity, with 2024 marking a peak. Machine learning and deep learning are central to key advancements, enabling early detection and risk prediction. Contributions from leading institutions highlight the global and interdisciplinary nature of this field, with studies demonstrating AIs potential to integrate complex datasets and deliver tailored therapies. While AI-driven innovations show promise, challenges such as ethical concerns and healthcare disparities remain. This analysis underscores AIs transformative potential in precision medicine and identifies opportunities for equitable, collaborative advancements.
Bramley, R.; Sharman, A.; Duerden, R.; Lyon, S.; Ryan, M.; Weber, E.; Brown, L.; Evison, M.
Show abstract
ObjectiveThis study aimed to establish a reproducible method for categorisation of the AI-detected chest X-ray (CXR) abnormalities that should be prioritised for urgent reporting to support faster lung cancer diagnosis. By selecting findings informed by cancer prevalence and clinical significance, we sought to maximise detection while maintaining a high negative predictive value (NPV). Materials and MethodsTwo cohorts of CXRs were evaluated: (1) a retrospective cohort of patients with confirmed lung cancer and abnormal CXRs, and (2) a prospective cohort of primary care referred CXRs from seven Greater Manchester trusts, with the AI system in shadow mode. The AI triage system (Annalise Enterprise CXR) evaluated the relative prevalence of 124 abnormalities, and prioritisation strategies were assessed using sensitivity, specificity, positive predictive value (PPV), and NPV. ResultsA total of 1,282 lung cancer patients were included in cohort 1. In cohort 2, the AI system processed 13,802 CXRs. Sensitivity was 95.87% (94.77%-96.97%) in cohort 1, and specificity was 79.11% (78.43%-79.79%) in cohort 2, with an NPV of 99.95%. ConclusionThis study presents a systematic, reproducible method for prioritising AI-detected CXR abnormalities, balancing high sensitivity and NPV while minimising low-risk prioritisation. This approach provides a data-driven alternative to traditional methods relying solely on clinical judgement.